post thumbnail

Build a high-performance crawler with Rust async

Build a high-performance async web crawler in Rust using Tokio, reqwest, and scraper. Modular design ensures concurrency, safety, and scalability. Features include request scheduling, HTML parsing, URL deduplication, and fault tolerance. Ideal for large-scale data extraction with compile-time memory safety guarantees.

2025-07-28


File 1
pasted_text_0.txt

In the previous three articles (Asynchronous Programming with async, [Deep Dive into Rust’s async Mechanism](https://xx/Rust async Technical Deep Dive), and Practical Applications of async), we progressively explored asynchronous programming—from its fundamental concepts to Rust’s async implementation and real-world applications. In this article, we continue with the theme of asynchronous programming by applying the discussed concepts to a practical project: building a high-performance web crawler using Rust async.

Why Rust?

Rust is a modern systems programming language focused on safety, speed, and concurrency. Its performance rivals C/C++, and its built-in concurrency model, combined with Tokio’s mature async runtime, effortlessly supports millions of concurrent coroutines. Additionally, Rust guarantees memory and thread safety at compile time, eliminating GC-related risks and avoiding common issues like memory leaks and crashes in crawler systems. Thus, Rust is exceptionally well-suited for developing high-performance crawlers.

However, these advantages also come with a learning curve. If you only need to scrape a few pages without high performance demands, Rust might not be the best choice—Python or Go could meet your needs more quickly.

The following dependencies are recommended for building a crawler system in Rust:

RequirementRecommended Crate
Async Runtimetokio
HTTP Clientreqwest
HTML Parsingscraper
URL Deduplicationdashmap

Architecture Design

To construct a high-performance async crawler system, we modularize its design into the following components:

Module Implementation

Requester

The Requester module constructs standardized HTTP requests using reqwest::RequestBuilder to set headers, User-Agent, and other attributes.

Define a Requester struct with properties like user_agent and delay_ms (for request delays). Example code:

use reqwest::{Client, RequestBuilder};
use std::time::Duration;

#[derive(Clone)]
pub struct Requester {
    client: Client,
    pub user_agent: String,
    pub delay_ms: u64,
}

impl Requester {
    pub fn new(user_agent: &str, delay_ms: u64) -> Self {
        let client = Client::builder()
            .user_agent(user_agent)
            .build()
            .unwrap();

        Self {
            client,
            user_agent: user_agent.to_string(),
            delay_ms,
        }
    }

    pub fn build_request(&self, url: &str) -> RequestBuilder {
        self.client.get(url)
    }
}

Fetcher

The Fetcher asynchronously executes HTTP requests using tokio and reqwest.

Define a Fetcher struct that utilizes Requester to perform requests. Example code:

use crate::requester::Requester; 
use tokio::time::{sleep, Duration};

pub struct Fetcher;

impl Fetcher {
    pub async fn fetch(requester: &Requester, url: &str) -> Option<String> {
        sleep(Duration::from_millis(requester.delay_ms)).await;

        let request = requester.build_request(url);

        match request.send().await {
            Ok(resp) => match resp.text().await {
                Ok(body) => Some(body),
                Err(e) => {
                    eprintln!("failed to read response: {e:?}");
                    None
                }
            },
            Err(e) => {
                eprintln!("failed to request: {e:?}");
                None
            }
        }
    }
}

Parser

The Parser processes HTML responses to extract target data and new links, leveraging scraper for parsing.

To accommodate diverse websites, abstract the Parser as a trait and implement specific parsers for different tasks. Example code:

use async_trait::async_trait;

#[async_trait]
pub trait Parser: Send + Sync {
    async fn parse(&self, html: &str) -> ParseResult;
}

pub struct ParseResult {
    pub data: Vec<String>,
    pub new_links: Vec<String>
}

Pipeline

The Pipeline asynchronously stores structured data, such as saving it to a database.

Given the variety of storage options, define Pipeliner as a trait. Example code:

use async_trait::async_trait;

#[async_trait]
pub trait Pipeline: Send + Sync {
    async fn process(&self, data: Vec<T>);
}

Scheduler

The Scheduler manages crawler task scheduling, serving as the core module. It uses tokio::sync::mpsc for task queues and dashmap for URL deduplication.

use tokio::sync::mpsc::{self, Sender, Receiver};
use dashmap::DashSet;
use std::sync::Arc;

pub struct Scheduler {
    seen: Arc<DashSet<String>>,
    sender: Sender<String>,
}

impl Scheduler {
    pub fn new(sender: Sender<String>) -> Self {
        Self {
            seen: Arc::new(DashSet::new()),
            sender,
        }
    }

    pub fn try_enqueue(&self, url: String) {
        if self.seen.insert(url.clone()) {
            let _ = self.sender.try_send(url);
        }
    }
}

Robustness Enhancements

While the above modules form the foundation, ensuring long-term stability and efficiency requires addressing the following aspects:

Performance Optimization

As the crawler scales, performance can be enhanced through:

Conclusion

Rust’s focus on safety, speed, and concurrency provides system-level performance and memory safety guarantees. Coupled with its vibrant ecosystem, Rust empowers developers to build high-performance, stable, and scalable crawler systems.

This article dissected a crawler system into modular components, addressed robustness challenges, and proposed optimizations to meet evolving demands.